train_pgo: Add json and avro training#29487
Conversation
There was a problem hiding this comment.
Pull request overview
This PR enhances PGO (Profile-Guided Optimization) training by adding Avro and JSON schema support alongside the existing Protobuf training. This helps optimize Redpanda's performance when Iceberg is used with these different serialization formats.
Changes:
- Added Avro and JSON schema definitions with corresponding sample payloads
- Refactored the Protobuf-specific setup function into a generic schema setup function that supports multiple formats
- Added a new function to send messages via rpk for testing Avro and JSON schemas
- Updated topic names to be format-specific and fixed hardcoded references
Iceberg generally still a massive performance concern. Adding avro training adds a another perf bump to help with that. We do a very minimal rpk based training which seems to result in good enough training coverage.
c5c32b6 to
92eb520
Compare
tools/pgo_bolt/train_pgo.py
Outdated
| }""" | ||
| JSON_TOPIC_NAME = "iceberg-json-topic" | ||
| JSON_SAMPLE_PAYLOAD = json.dumps( | ||
| {"name": "hello my name is json shady", "id": 13579, "ts": 1625079045123456} |
There was a problem hiding this comment.
put an array and null in there so we hit those paths?
There was a problem hiding this comment.
I could add it but it seems make very little difference either way. E.g.: in our datalake omb test we use a message with only string fields and just training on a single integer id field results in same perf as what is shown above. So it seems to not make much difference. Whatever you prefer.
There was a problem hiding this comment.
"So it seems to not make much difference."
The effort to do it is close to zero, so yes I think we should. We always suffer from the problem of our tests being much narrower than real world scenarios so we need to augment our tests (which unfortunately are very similar between training and validation) with our judgment and guesses: imagine someone has a schema which is mostly one giant array. Then it may matter.
Similar to the avro training also add a json equivalent.
92eb520 to
af44bed
Compare
Add Avro and JSON training to further help with iceberg perf when those are in use.
Backports Required
Release Notes